Name: Zhixin Tang (914021146)
Coronavirus disease (COVID-19) is a worldwide pandemic that began in 2019 till now. In the last two years, a lot of people have had Coronavirus, and a lot of people have even died because of Coronavirus. For all countries, Coronavirus disease is a big challenge because Coronavirus disease is a highly contagious viral virus. Now, there is no vaccine that can completely conquer Coronavirus. No one knows how long Coronavirus disease will last. Therefore, to help normal people who do not have any background to better understand Coronavirus and further to find a way to reduce Coronavirus cases, statistical analysis is particularly important. In this report, I first introduce Coronavirus disease in the background part. I mainly use the data set by the World Health Organization to show people an overview of Coronavirus disease in different countries, such as the number of cases, the number of deaths, and the case-mortality rates in different countries. The dataset can be downloaded from https://covid19.who.int/WHO-COVID-19-global-data.csv. Then, I will find out which kinds of countries have relatively higher cases of Coronavirus. As it is not possible to analyze all the countries, I choose the first 5 countries which have the highest Coronavirus cases and deaths. Then, I compare the strictness of policies, capital GPD, hospital beds per thousand of these countries, and the total population of these countries to find if there exist some similarities or differences between these countries. The measurements are from the second data set: https://ourworldindata.org/covid-vaccinations.
In response to COVID-19, countries around the world have gradually changed from the initial policy of closing cities and zeroing to coexisting and dying with the virus. The most important factor in this is the research and development, promotion and popularization of the COVID-19 vaccine. By injecting the COVID-19 vaccine, the body’s immune system will acquire the corresponding antibodies, so that when an individual is exposed to the COVID-19 virus, the individual will improve their own chances of survival because they have been vaccinated. Also, while vaccines are very effective at preventing serious infection and death, they are not 100 percent effective at stopping the acquisition and spread of the virus. Especially in areas with low vaccination rates and high infection rates, vaccinated people are more likely to be exposed to the virus and get breakthrough infections. And given the highly contagious mutation, countries around the world consider that periodic boosters will need to be implemented in the coming years.
However, some people have been questioning the safety and effectiveness of vaccines. It is mainly reflected in the slight side effects of the vaccinated population, and the risk of contracting COVID-19 after receiving the vaccine. Thus, checking the effectiveness of Coronavirus vaccines is especially important. For analyzing the effectiveness of Coronavirus vaccines, I mainly compare the new deaths each day, icu patients per million with the number of people taking at least one dose of vaccine to see if vaccines decrease death cases. As there are too many samples and to reduce the potential problems, I will mainly focus on the United States. United States has the highest GPD and advanced medical equipment. Moreover, according to common sense, United States is the epicenter of the Coronavirus disease. The dataset I will use for the Coronavirus vaccine is from the Website “Our World in Data”, https://ourworldindata.org/covid-vaccinations.
To make my report clearly, I post some question I want to figure out in the following analysis.
What is the trend of cases of Coronavirus in different countries and what is the trend of fatality rate of different countries?
Which countries have relatively higher cases of Coronavirus, and which countries have relatively lower cases of Coronavirus? What are the similarities of these countries?
What is the effectiveness of vaccines to reduce the death cases and severe cases in the United States.
A novel coronavirus (CoV) named ‘2019-nCoV’ or ‘2019 novel coronavirus’ or ‘COVID-19’ by the World Health Organization (WHO) is in charge of the current outbreak of pneumonia that began at beginning of December 2019 near Wuhan City, Hubei Province, China\({^{[1]}}\). Since 2019, Coronavirus has not been resolved until now and may never be resolved. People in the future may need to find a way to reduce the effects of Coronavirus on people’s lives and coexist with Coronavirus. Thus, better understanding the features of Coronavirus and making effective vaccines are the primary tasks for people to combat Coronavirus. Coronavirus is a virus around the lungs and it is very similar to Severe Acute Respiratory Syndrome (SARS). The symptoms of Coronavirus can be fever, cough, shortness of breath, pneumonia, or breathing difficulties\({^{[2]}}\). For the recent detect of Coronavirus, Omicron, which was first detect in South Africa in November 2021, the symptoms are mainly sore throat and headache. Coronavirus is a virus that propagates fast. COVID-19 spreads mainly through three ways: Inhalation of air when near infected people who inhale small particles of virus; small particles containing Coronavirus on eyes, nose, and mouth; Use the hand with Coronavirus to touch eyes, nose, and mouth. From the World Health Organization Website, there are 416614051 confirmed cases, 5844097 confirmed deaths, and 10279668555 vaccine doses administered as of 14th February, 2022. The total death rate is about 1.403%. The death rate seems very low, but according to the fast spread of Coronavirus, the low death rate still means a huge number of cases of deaths.
Coronavirus not only influences people’s health, but also influences the economy of different countries. In order to minimize the propagation of COVID-19, a lot of countries reduce the number of international flights. Also, more and more people are not willing to travel to other countries, which has dealt a heavy blow to countries that rely on tourism as an important income. The World Bank report forecasted a 5.25 contraction in global GPD in 2020\({^{[3]}}\). A lot of companies and restaurants are bankrupt because of Coronavirus disease.
Coronavirus death cases in different ranges of ages are different. Specifically, from the data of deaths in the United States, as of February 9, 2022, there are 231722 death cases of 85 years and older people, 230870 death cases of 75-84 years old people, 205457 death cases of 65-74 years old people, 168765 deaths cases of 50-64 years old people, 39166 death cases of 40-49 years old people, 16343 death cases of 30-39 years old people, 5581 death cases of 18-29 years old people, and 795 death cases of 0-17 years old people. The death cases in the United States consistently decrease when the age of people decreases. The reasons that death cases in older people are higher may be that Coronavirus can cause some other diseases and it can attack the immune systems of people.
| Variable_name | Type | Description |
|---|---|---|
| Date_reported | Date | Date of reporting to WHO |
| Country_code | String | ISO Alpha-2 country code |
| Country | String | Country, territory, area |
| WHO_region | String | WHO regional offices: WHO Member States are grouped into six WHO regions–Regional Office for Afria (AFRO), Regional Office for the Americas (AMRO), Regional Office for South-East Asia (SEARO), Regional Office for Europe (EURO), Regional Office for Eastern Mediterranean (EMRO), and Regional Office for the Western Pacific (WPRO). |
| New_cases | Integer | New confirmed cases. Calculated by subtracting previous cumulative case count form current cumulative cases count |
| Cumulative_cases | Integer | Cumulative confirmed cases reported to WHO to data |
| New_deaths | Integer | New confirmed deaths. Calculated by subtracting previous cumulative deaths from current cumulative deaths |
| Cumulative_deaths | Integer | Cumulative confirmed deaths reported to WHO to data |
| Variable_name | Type | Description |
|---|---|---|
| data | Data | Data of reporting |
| location | categorical | The name of different country |
| new_deaths | numerical | number of new deaths in different countries each day |
| icu_patients_per_million | numerical | patients per million in ICU |
| people_vaccinated | numerical | Number of people who takes vaccine |
| people_fully_vaccinated | numerical | Number of people fully vaccinated |
| total_boosters | numerical | Number of people who take boosters |
| stringency_index | numerical | The strickness of policies in different countries (level 0-10) |
| hospital_beds_per_thousand | numerical | Number of beds per thousand of hospital in different countries |
| gdp_per_capita | numerical | gda per capita in different countries |
In this project, I cannot analyze the effectiveness of Vaccine in all the countries. Thus, I only choose one representative countries for the Vaccine Analysis. I choose to focus on the United States of America. The United States of America has the highest Coronavirus cases and deaths. Moreover, the medical level and gdp in the United States is one of the best in the world. Therefore, people may be unlikely to die because of the limited medical resources, money, and so on than in other countries.
COVID-19 daily deaths and severe cases in U.S.
Here, the visualizations are used to explore the relationship between the daily COVID-19 situation in the United States, the state of vaccinations in the United States, and the two. From figure 1, it can be found that from July 2020 to February 2022, the new deaths of COVID-19 cases fluctuated in July 2020, January 2021, August 2021 and February 2022, respectively. three peaks. Correspondingly, the daily admission of severe COVID-19 patients in the ICU peaked in January 2021. From August 2021 to February 2022, the daily admission of critically ill patients in the ICU showed a high fluctuation. This two plots clearly show that there are around 3 peak of new deaths and ICU patients in the United States. These peaks may cause by mutations of Coronavirus. Moreover, when people gradually take vaccines, they may feel they immune for Coronavirus and they may not keep distance to each other and may not wear mask. Then, they still may get Coronavirus.
Number of people vaccinated in the U.S.
The U.S. COVID-19 vaccine policy began in December 2020. It can be seen from the figure that the number of vaccinated people increased rapidly from 2021, and by June 2021, the increase in the number of vaccinated people gradually slowed down. The number of people fully vaccinated with the three-shot vaccine also increased rapidly until June 2021, and then gradually slowed down. The number of boosters vaccinated has been increasing since August 2021, and the growth has been relatively rapid.
Vaccination Effectiveness
In this project, we are most concerned about the effectiveness of the COVID-19 vaccine. Here I use scatter-plots to explore the relationship between vaccinated numbers and daily new deaths, and vaccinated numbers versus ICU admissions . It can be seen from the figure that with the implementation of the vaccination policy, the number of daily deaths has dropped sharply, but with the increase in the number of vaccinations, the number of daily deaths has risen sharply again with great fluctuations. The relationship between the number of ICU admissions and the number of vaccinations also showed a rapid decline with the increase of the number of vaccinations. However, when the number of vaccinations reached a certain scale, the number of ICU admissions showed a sharp rebound.
Correlation matrix
For a more comprehensive evaluation of the vaccine data, I created a new variable Is Vaccinated, which is a training variable indicating whether or not to start vaccination. Here, I use a correlation matrix to explore the correlation between variables. The ellipse in the figure represents the correlation, and the numbers represent the correlation coefficient. The flatter the ellipse, the stronger the correlation. Red represents negative correlation, blue represents positive correlation. Blank squares in the figure indicate that the correlation is not significant. It can be seen from the figure that there is a very strong positive correlation between the number of vaccinations and the number of fully vaccinated (three shots), indicating that there may be severe collinearity in subsequent modeling. So we need to consider using one of these two variables as an explanatory variable. ICU patients showed a strong positive correlation with daily deaths, which means that once infected with COVID-19, once they enter the ICU, they are likely to die of the disease. In addition, it can be found from the figure that the number of daily deaths is weakly negatively correlated with the scale of vaccination, while the correlation between the number of ICU admissions and the scale of vaccination is not significant.
In this project, I am interested in the daily number of deaths and patients admitted to the ICU. Here I describe the distribution of these two variables. As can be seen from the figure, daily deaths are heavily right-skewed, while daily ICU admissions show a weaker right-skewed distribution. And the values of these two variables are larger. This means that in the subsequent linear regression, I need to log-transform these two variables. In addition, it can be found from the figure that after the logarithmic transformation, the variation of the variables of the data is small, and it is more similar to the normal distribution.
Regression analysis was mainly used in this project to explore the effectiveness of vaccines. Regression analysis is a very simple and effective model. Its data requirements are simple and its results are simple to understand. Also the data I’m using has a temporal nature, so I need to account for its lag correlation. From the previous analysis, it can be found that there is a strong lag correlation between daily deaths and daily ICU cases. So I need to take its lagged series into the model for consideration.
It usually takes several weeks after COVID-19 vaccination to produce T and B lymphocytes. Therefore, because the vaccine does not have enough time to provide protection, it is possible to contract the virus that causes COVID-19 before or after the vaccination, and then get sick. Here I lag the vaccination data by 14 days and 21 days and rematch it with the number of COVID-19 infections for modeling analysis. Overall, I will make two models. In the first model, I will lag the vaccination data by 14 days, and in the second model, I will lag the vaccination data by 21 days.
In the first model, I lag the vaccination data by 14 days. The previous correlation matrix tells us that there is an extremely strong correlation between the two indicators of vaccination. Fully incorporating these two metrics into a linear regression model can lead to severe collinearity. In real life, either of these two metrics can tell a story about the status of vaccinations. Here, I built a regression model of daily deaths using only the number of fully vaccinated and daily ICU cases as explanatory variables. In addition, I created a corresponding dummy variable Is_vaccinated to consider the effect of implementing the vaccine policy. Since the data are time series, there may be correlations between the response variables. As a result, the model residuals will appear autocorrelation. When the model residuals are auto-correlated, the OLS estimator is still unbiased and consistent, and when the sample size is large, the estimator still obeys the normal distribution, but the estimation is not an effective estimation. Therefore I need to test the autocorrelation of the residuals. The Durbin Watson test shows a p-value of 2.2e-16 and less than 0.05, indicating that there is autocorrelation in the model, so I need to correct the model’s estimates accordingly.
##
## Durbin-Watson test
##
## data: mod1
## DW = 1.145, p-value < 2.2e-16
## alternative hypothesis: true autocorrelation is greater than 0
##
## t test of coefficients:
##
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.90489070 0.22943809 -3.9439 9.029e-05
## log(icu_patients) 1.26141770 0.10273860 12.2779 < 2.2e-16
## people_fully_vaccinated_per_hundred -0.00239692 0.00039457 -6.0748 2.289e-09
## total_boosters_per_hundred 0.00328070 0.00093578 3.5058 0.0004915
## Is_Vaccinated 0.06825685 0.01203533 5.6714 2.268e-08
##
## (Intercept) ***
## log(icu_patients) ***
## people_fully_vaccinated_per_hundred ***
## total_boosters_per_hundred ***
## Is_Vaccinated ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The preceding analysis shows that there is an autocorrelation between the residuals. At the same time to avoid heteroscedasticity. When estimating the covariance matrix between coefficients, I use heteroscedasticity and autocorrelation consistent (HAC) covariance matrix estimation. The model results are as follows, where the model formula is :\[log(new \_death_t) = -0.90 - 0.002 people_fully \ vaccinated_per \ hundred_t + 1.26 log(ICU \ patients_t) \\+ 0.003 total\_boosters\_per\_hundred + 0.068Is\_Vaccinated\]
The coefficient of log(icu_patients) is 1.26, which means that for every 1% increase in ICU admission, the number of log daily deaths increases by about 1.26%, holding other variables fixed. The effect of icu_patients is significant, which means that once COVID-19 infected patients enter the ICU for treatment, they will have a high probability of death. The coefficient of people_fully_vaccinated_per_hundred is -0.002, which implies that for every 10 per hundred more people fully vaccinated, the log daily death toll would fall by about 2%, holding other variables fixed. The effect is significant, which shows that the implementation of the vaccine can effectively reduce the mortality rate of the population. The coefficient of total_boosters_per_hundred is 0.003, which means that for every 10 additional 100 people vaccinated with booster shots, the log daily death toll will increase by about 3%, holding other variables fixed. The effect is significant, possibly because booster needles reduce people’s self-confidence, which increases the chance of infection. The coefficient of Is_vaccinated is 0.068, which suggests that implementing a vaccine policy would increase the number of log daily deaths by about 6.8%, holding other variables fixed. and its effect is significant.
Residual Diagnostics for Regression of Daily Deaths
To increase confidence in the above analysis, a residual analysis is performed here. The upper left graph is a scatterplot of residuals versus fitted values. The residuals in the figure are randomly distributed uniformly and there is no trend between the residuals and the fitted values, which suggests that the residuals satisfy the conditional zero-mean assumption. The upper right plot shows that the residuals are distributed around the theoretical line, and which implies that the residuals are normally distributed. The residuals in the lower left plot show some fluctuating clustering, which shows the residuals are heteroskedasticity. The bottom right graph shows that there are no outliers. This shows that our estimates of the coefficients are unbiased estimates, but their p-values may be amplified due to heteroskedasticity in the residuals. Therefore, we need to use heteroscedasticity-consistent covariance matrix estimation when presenting the model (in fact, I used this estimation in the previous part of the model presentation).
In the second model, I lag vaccination data by 21 days. The Durbin Watson test shows a p-value of 2.2e-16 and less than 0.05, indicating that there is autocorrelation in the model, so I also need to correct the model’s estimates accordingly. The results of Durbin Watson test in these two models are similar.
##
## Durbin-Watson test
##
## data: mod2
## DW = 1.1272, p-value < 2.2e-16
## alternative hypothesis: true autocorrelation is greater than 0
##
## t test of coefficients:
##
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.85435885 0.24079007 -3.5481 0.0004208
## log(icu_patients) 1.23884839 0.10800600 11.4702 < 2.2e-16
## people_fully_vaccinated_per_hundred -0.00236140 0.00041961 -5.6276 2.903e-08
## total_boosters_per_hundred 0.00312709 0.00094428 3.3116 0.0009880
## Is_Vaccinated 0.06756793 0.01429748 4.7259 2.908e-06
##
## (Intercept) ***
## log(icu_patients) ***
## people_fully_vaccinated_per_hundred ***
## total_boosters_per_hundred ***
## Is_Vaccinated ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The preceding analysis shows that there is an autocorrelation between the residuals. At the same time to avoid heteroscedasticity. When estimating the covariance matrix between coefficients, I also use heteroscedasticity and autocorrelation consistent (HAC) covariance matrix estimation. The model results are as follows, where the model formula is :\[log(new \_death_t) = -0.85 - 0.002 people_fully \ vaccinated_per \ hundred_t + 1.24 log(ICU \ patients_t) \\+ 0.003 total\_boosters\_per\_hundred + 0.067Is\_Vaccinated\]
The coefficient of log(icu_patients) is 1.23, which means that for every 1% increase in ICU admission, the number of log daily deaths increases by about 1.23%, holding other variables fixed. The effect of icu_patients is significant, which means that once COVID-19 infected patients enter the ICU for treatment, they will have a high probability of death. The coefficient of people_fully_vaccinated_per_hundred is -0.002, which implies that for every 10 per hundred more people fully vaccinated, the log daily death toll would fall by about 2%, holding other variables fixed. The effect is significant, which shows that the implementation of the vaccine can effectively reduce the mortality rate of the population. The coefficient of total_boosters_per_hundred is 0.003, which means that for every 10 additional 100 people vaccinated with booster shots, the log daily death toll will increase by about 3%, holding other variables fixed. The effect is significant, possibly because booster needles reduce people’s self-confidence, which increases the chance of infection. The coefficient of Is_vaccinated is 0.067, which suggests that implementing a vaccine policy would increase the number of log daily deaths by about 6.7%, holding other variables fixed. and its effect is significant.
Residual Diagnostics for Regression of Daily Deaths
To increase confidence in the above analysis, a residual analysis is performed here. The upper left graph is a scatterplot of residuals versus fitted values. The residuals in the figure are randomly distributed uniformly and there is no trend between the residuals and the fitted values, which suggests that the residuals satisfy the conditional zero-mean assumption. The upper right plot shows that the residuals are distributed around the theoretical line, and which implies that the residuals are normally distributed. The residuals in the lower left plot show some fluctuating clustering, which shows the residuals are heteroskedasticity. The bottom right graph shows that there are no outliers. This shows that our estimates of the coefficients are unbiased estimates, but their p-values may be amplified due to heteroskedasticity in the residuals. Therefore, we need to use heteroscedasticity-consistent covariance matrix estimation when presenting the model.
The results from this two models are very similar to each other. Therefore, the difference between lagging vaccination by 14 days and lagging vaccination by 21 days is small, and we can use any one of the model.
As my treatments for the two models are numerical variables except Is_vaccinated, it is hard to make assumptions to test the causal effect of vaccine and deaths of Coronavirus. Therefore, I just briefly explain the causality between vaccine and deaths. In this two model, I mainly investigate the effectiveness of vaccine and I lag vaccination data by 14 days and 21 days because vaccine takes time to work. Thus, I merge vaccine data with 14-day and 21-day lag death data. According the two model, vaccine and deaths have significant relationship. I think the deaths cases are influenced by vaccine.
In this part, I also want to find the relationship between the number of people with severe Coronavirus and number of people taking vaccine. I choose the variable icu patients as the representation for the number of people with severe Coronavirus. I use the data with lagging vaccination data by 14 days.
##
## Durbin-Watson test
##
## data: mod3
## DW = 0.0049615, p-value < 2.2e-16
## alternative hypothesis: true autocorrelation is greater than 0
Based on the previous exploratory analysis, I introduced a quadratic term in the regression model of ICU cases. The Durbin Watson test shows that the model has residual autocorrelation.
##
## t test of coefficients:
##
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.23299061 0.00034307 6508.8288 < 2.2e-16
## people_fully_vaccinated_per_hundred -0.00144589 0.00015681 -9.2206 < 2.2e-16
## total_boosters_per_hundred 0.00404973 0.00029615 13.6746 < 2.2e-16
## Is_Vaccinated 0.05428179 0.00432469 12.5516 < 2.2e-16
##
## (Intercept) ***
## people_fully_vaccinated_per_hundred ***
## total_boosters_per_hundred ***
## Is_Vaccinated ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Here, I also use heteroscedasticity and autocorrelation consistent (HAC) covariance matrix estimation.The model results are as follows, where the model formula is :\[log(icu \ patients_t) = 2.23 - 0.001 people_fully \ vaccinated_per \ hundred_t + 0.004 total\_boosters\_per\_hundred \\+ 0.054 Is\_Vaccinated\]
The coefficient of people_fully_vaccinated_per_hundred is -0.001, which means that for every ten person per hundred people who are fully vaccinated, the log of average ICU admission population decreases by about 1%, holding other variables fixed. The coefficient of total_boosters_per_hundred is 0.004, indicating that for each additional ten person per 100 people receiving booster shots, the log of average ICU admission population decreases by about 4%, holding other variables fixed. The coefficient of Is_Vaccinated is 0.054, indicating that after the implementation of the vaccine policy, the log of average ICU admission population increased by about 5.4%, holding other variables fixed.
Residual Diagnostics for Regression of ICU Patients
The residual plots of the model looks very strange. From the figure, it can be found that the model residuals satisfy the conditional zero mean assumption, normal distribution assumption. However, the residuals seem not randomly distributed. There are some possible reasons that these residual plots look strange. The variable icu patients may have some potential problems. The number of icu patients relates to the number of hospitals, the number of beds in the hospitals, the number of doctors in the hospitals. If a hospital is full, even there are some more people with severe Coronavirus, they cannot move to the ICU in this hospital. Therefore, the number of icu patients sometimes cannot represent the number of people with severe Coronavirus.
Coronavirus is the worldwide Epidemic. Different countries have different situations for COVID-19. For instance, the United States, India, the United Kingdom, Russia, and France face serious problems with Coronavirus, since these countries have relatively higher Coronavirus cases and deaths. However, the case fatality rate in these countries is not high. Therefore, there may not exist a relationship between the number of cases and deaths of Coronavirus and the case fatality rate. In further analysis, I find that the countries with a higher number of cases and deaths of Coronavirus are likely to have higher populations. Moreover, the countries with the lower case fatality rate of Coronavirus seem to have higher gdp per capita and a higher number of hospital beds per thousand.
For the vaccine part, through previous research, it was found that the implementation of the vaccine policy will significantly increase the number of ICU admissions and the number of daily deaths, and the increase in the number of vaccinations will significantly reduce the number of daily deaths and the number of ICU admissions. The increase in the number of booster shots will significantly increase the number of daily deaths and the number of ICU admissions. This is mainly due to the fact that countries around the world have changed from a policy of closing cities and clearing the city to coexisting with the virus. And the vaccine as its strong policy. Countries open their borders by implementing vaccine policies, which increases foreign risks. In other words, not because the implementation of the vaccine policy poses a high risk. Rather, it is because of the opening of borders at the same time as the implementation of the vaccine policy, which increases the risk of contracting COVID-19, thereby increasing the number of deaths and the number of ICU admissions. At the same time, the booster needle appeared in response to the newly mutated COVID-19 virus. In other words, the booster needle was implemented because the newly mutated COVID-19 virus has caused us huge losses and seriously threatened our lives. Therefore, in general, the implementation of vaccines can significantly reduce the probability of severe illness and death.
In this project, I used only four months of observational data. The sample size may not be sufficient to reflect the real situation. In the sample data, the number of people who were fully vaccinated was between 59 and 64. This interval is too narrow to effectively observe the efficacy of the vaccine. In addition, according to scientific guidance, vaccines need to reach a certain coverage to be effective. Also, I only used the number of people who were fully vaccinated as an explanatory variable, which may have left out some important variables. Therefore, in order to draw more valid, credible and scientific conclusions, I not only need data from longer observation periods, but also need to use other valid information to think about, and consider using other higher-order models.
[1] Artificial Intelligence for Coronavirus Outbreak. 2020 Jun 23 : 1–22. Published online 2020 Jun 23. doi: 10.1007/978-981-15-5936-5_1
[2] Key messages and actions for prevention and control in schools Key Messages and Actions for COVID-19 Prevention and Control in Schools (2020) (March), p. 13. Available at: https://www.who.int/docs/default-source/coronaviruse/key-messages-and-actions-for-covid-19-prevention-and-control-in-schools-march-2020.pdf?sfvrsn=baf81d52_4#:∼:text=COVID-19isa,2019-nCoV
[3] The Global Economic Outlook During the COVID-19 Pandemic: A Changed World (2020). Available at: https://www.worldbank.org/en/news/feature/2020/06/08/the-global-economic-outlook-during-the-covid-19-pandemic-a-changed-world (Accessed: 23 November 2020).
knitr::opts_chunk$set(echo = TRUE)
library(gridExtra)
library(scales)
library(lubridate)
library(ggplot2)
library(MASS)
library(dplyr)
library(knitr)
library(kableExtra)
covid <- read.csv(file = "WHO-COVID-19-global-data.csv",stringsAsFactors = TRUE)
vaccine <- read.csv(file = "owid-covid-data.csv")
class(vaccine$location)
vaccine$location <- as.factor(vaccine$location)
levels(vaccine$location)
vaccine_df <- vaccine[vaccine$location=="United States",]
variable_table_1 <- data.frame(Variable_name=c("Date_reported","Country_code","Country","WHO_region","New_cases","Cumulative_cases","New_deaths","Cumulative_deaths"), Type=c("Date","String","String","String","Integer","Integer","Integer","Integer"), Description=c("Date of reporting to WHO","ISO Alpha-2 country code","Country, territory, area", "WHO regional offices: WHO Member States are grouped into six WHO regions--Regional Office for Afria (AFRO), Regional Office for the Americas (AMRO), Regional Office for South-East Asia (SEARO), Regional Office for Europe (EURO), Regional Office for Eastern Mediterranean (EMRO), and Regional Office for the Western Pacific (WPRO).", "New confirmed cases. Calculated by subtracting previous cumulative case count form current cumulative cases count","Cumulative confirmed cases reported to WHO to data","New confirmed deaths. Calculated by subtracting previous cumulative deaths from current cumulative deaths","Cumulative confirmed deaths reported to WHO to data"))
kable(variable_table_1) %>% kable_styling(bootstrap_options = c("striped","hover"))
variable_table_2 <- data.frame(Variable_name=c("data","location","new_deaths","icu_patients_per_million","people_vaccinated","people_fully_vaccinated","total_boosters","stringency_index","hospital_beds_per_thousand","gdp_per_capita"), Type=c("Data","categorical","numerical","numerical","numerical","numerical","numerical","numerical","numerical","numerical"), Description=c("Data of reporting","The name of different country","number of new deaths in different countries each day","patients per million in ICU", "Number of people who takes vaccine","Number of people fully vaccinated","Number of people who take boosters", "The strickness of policies in different countries (level 0-10)", "Number of beds per thousand of hospital in different countries","gda per capita in different countries"))
kable(variable_table_2) %>% kable_styling(bootstrap_options = c("striped","hover"))
summary(covid)
covid$Date_reported <- as.Date(covid$Date_reported)
covid$Country_code <- as.factor(covid$Country_code)
covid$Country <- as.factor(covid$Country)
covid$WHO_region <- as.factor(covid$WHO_region)
covid$New_cases <- as.numeric(covid$New_cases)
covid$Cumulative_cases <- as.numeric(covid$Cumulative_cases)
covid$New_deaths <- as.numeric(covid$New_deaths)
covid$Cumulative_deaths <- as.numeric(covid$Cumulative_deaths)
covid %>% group_by(Country) %>%
summarise(mean_new_case=mean(New_cases),median_new_case=median(New_cases),total_cases=max(Cumulative_cases)) %>% arrange(desc(mean_new_case))
covid %>% group_by(Country) %>%
summarise(mean_new_deaths=mean(New_deaths),median_new_deaths=median(New_deaths),total_deaths=max(Cumulative_deaths)) %>% arrange(desc(total_deaths))
covid_new <- covid %>% group_by(Country) %>%
summarise(mean_new_case=mean(New_cases),median_new_case=median(New_cases),total_cases=max(Cumulative_cases), mean_new_deaths=mean(New_deaths),median_new_deaths=median(New_deaths),total_deaths=max(Cumulative_deaths)) %>% arrange(desc(mean_new_case))
covid_new <- covid_new[-c(228:237),]
covid_new$case_fatality_rate <- covid_new$total_deaths/covid_new$total_cases
covid_new %>% dplyr::select(Country , case_fatality_rate) %>% arrange(desc(case_fatality_rate))
options(warn = -1)
fig.spaghetti.1 <- covid %>%
filter(Date_reported>= "2020-01-03", Date_reported<= "2022-02-11", Country==c("United States of America", "Brazil","India","Russian Federation","Mexico","Peru","France","the United Kingdom","Yemen","Sudan","Syrian Arab Republic","Egypt")) %>%
mutate(Date=as.Date(Date_reported)) %>%
ggplot(aes(x=Date,y=New_cases,by=Country)) +
geom_line(aes(color=Country))+
ggtitle("New cases v.s. Time")
fig.spaghetti.1
options(warn = -1)
fig.spaghetti.2 <- covid %>%
filter(Date_reported>= "2020-01-03", Date_reported<= "2022-02-11", Country==c("United States of America", "Brazil","India","Russian Federation","Mexico","Peru","France","the United Kingdom","Yemen","Sudan","Syrian Arab Republic","Egypt")) %>%
mutate(Date=as.Date(Date_reported)) %>%
ggplot(aes(x=Date,y=New_deaths,by=Country)) +
geom_line(aes(color=Country))+
ggtitle("New deaths v.s. Time")
fig.spaghetti.2
c <- which(covid$New_cases==0)
covid_2 <- covid[-c,]
covid_2$case_fatality <- covid_2$New_deaths/covid_2$New_cases
options(warn = -1)
fig.spaghetti.3 <- covid_2 %>%
filter(Country==c("United States of America", "Brazil","India","Russian Federation","Mexico","Peru","France","the United Kingdom","Yemen","Sudan","Syrian Arab Republic","Egypt")) %>%
mutate(Date=as.Date(Date_reported)) %>%
ggplot(aes(x=Date,y=case_fatality,by=Country)) +
geom_line(aes(color=Country))+
ggtitle("Case fatality rate v.s. Time")
fig.spaghetti.3
vaccine_new <- vaccine %>% dplyr::select(location, stringency_index, population, gdp_per_capita,hospital_beds_per_thousand) %>% filter(location==c("United States", "Brazil","India","Russia","Mexico","Peru","France","United Kingdom","Yemen","Sudan","Syria","Egypt")) %>% group_by(location) %>% summarise(mean_str=mean(stringency_index,na.rm=TRUE),mean_pop=mean(population,na.rm=TRUE),mean_gdp=mean(gdp_per_capita,na.rm=TRUE),mean_beds=mean(hospital_beds_per_thousand,na.rm=TRUE)) %>% arrange(desc(mean_str))
vaccine_new <- vaccine_new[rowSums(is.na(vaccine_new))==0,]
vaccine_new
options(warn = -1)
vaccine_2 <- vaccine %>% dplyr::select(location, stringency_index, population, gdp_per_capita, hospital_beds_per_thousand,new_cases, new_deaths, total_cases, total_deaths) %>% group_by(location) %>% summarise(mean_str=mean(stringency_index,na.rm=TRUE),mean_pop=mean(population,na.rm=TRUE),mean_gdp=mean(gdp_per_capita,na.rm=TRUE),mean_beds=mean(hospital_beds_per_thousand,na.rm=TRUE),mean_cases=mean(new_cases,na.rm=TRUE),mean_deaths=mean(new_deaths,na.rm=TRUE),total_cases=max(total_cases,na.rm = TRUE),total_deaths=max(total_deaths,na.rm = TRUE))
vaccine_2 <- vaccine_2[rowSums(is.na(vaccine_2))==0,]
vaccine_2$case_fatality_rate <- vaccine_2$total_deaths/vaccine_2$total_cases
vaccine_3 <- vaccine_2[,-c(1,8,9)]
m_cor <- cor(vaccine_3)
m_cor
library(tidyverse)
library(lubridate)
library(ggsci)
library(patchwork)
library(forecast)
library(gglm)
vaccine_df <- vaccine_df %>%
mutate(date = ymd(date)) %>%
dplyr::select(date, new_deaths, icu_patients, people_vaccinated_per_hundred,
people_fully_vaccinated_per_hundred, total_boosters_per_hundred) %>%
mutate(vaccinated = ifelse(is.na(people_vaccinated_per_hundred), 0, 1),
people_vaccinated_per_hundred = ifelse(is.na(people_vaccinated_per_hundred), 0, people_vaccinated_per_hundred),
people_fully_vaccinated_per_hundred = ifelse(is.na(people_fully_vaccinated_per_hundred), 0, people_fully_vaccinated_per_hundred),
total_boosters_per_hundred = ifelse(is.na(total_boosters_per_hundred), 0, total_boosters_per_hundred)) %>%
dplyr::filter(!is.na(icu_patients))
vaccine_df %>%
dplyr::select(date, new_deaths, icu_patients) %>%
pivot_longer(-date, names_to = 'Var', values_to = 'Val') %>%
mutate(Var = ifelse(Var == 'new_deaths', 'new deaths', 'ICU patients')) %>%
ggplot(aes(x = date, y = Val, col = Var, gorup = Var)) +
geom_line() +
theme(panel.background = element_blank(),
panel.grid.major = element_blank(),
panel.grid.major.y = element_line(colour = 'grey'),
legend.position = 'top') +
labs(x = '', y = 'Number', col = '') +
scale_x_date(date_labels = "%Y-%m") +
facet_grid(vars(Var), scales = 'free') +
guides(col = 'none') +
scale_color_jama()
vaccine_df %>%
dplyr::select(people_vaccinated_per_hundred, date,
people_fully_vaccinated_per_hundred, total_boosters_per_hundred) %>%
pivot_longer(-date, names_to = 'Var', values_to = 'Val') %>%
ggplot(aes(x = date, y = Val, col = Var, gorup = Var)) +
geom_line() +
scale_color_d3(limits = c('people_vaccinated_per_hundred',
'people_fully_vaccinated_per_hundred', 'total_boosters_per_hundred'),
labels = c('vaccinated', 'fully vaccinated', 'boosters')) +
theme(panel.background = element_blank(),
panel.grid.major = element_blank(),
panel.grid.major.y = element_line(colour = 'grey'),
legend.position = 'top') +
labs(x = '', y = 'people per hundred', col = '') +
scale_x_date(date_labels = "%Y-%m")
p1 <- vaccine_df %>%
dplyr::select(people_vaccinated_per_hundred, icu_patients,
people_fully_vaccinated_per_hundred, total_boosters_per_hundred) %>%
pivot_longer(-icu_patients, names_to = 'Var', values_to = 'Val') %>%
mutate(Var = case_when(Var == 'people_vaccinated_per_hundred'~'vaccinated',
Var == 'people_fully_vaccinated_per_hundred'~'fully vaccinated',
Var == 'total_boosters_per_hundred' ~ 'boosters')) %>%
ggplot(aes(y = icu_patients, x = Val, col = Var)) +
geom_point() +
scale_color_d3() +
theme(panel.background = element_blank(),
panel.grid.major = element_blank(),
panel.grid.major.y = element_line(colour = 'grey'),
legend.position = 'top') +
facet_wrap(vars(Var), scales = 'free') +
labs(y = 'ICU patients', x = 'Number') +
guides(col = 'none')
p2 <- vaccine_df %>%
dplyr::select(people_vaccinated_per_hundred, new_deaths,
people_fully_vaccinated_per_hundred, total_boosters_per_hundred) %>%
pivot_longer(-new_deaths, names_to = 'Var', values_to = 'Val') %>%
mutate(Var = case_when(Var == 'people_vaccinated_per_hundred'~'vaccinated',
Var == 'people_fully_vaccinated_per_hundred'~'fully vaccinated',
Var == 'total_boosters_per_hundred' ~ 'boosters')) %>%
ggplot(aes(y = new_deaths, x = Val, col = Var)) +
geom_point() +
scale_color_d3() +
theme(panel.background = element_blank(),
panel.grid.major = element_blank(),
panel.grid.major.y = element_line(colour = 'grey'),
legend.position = 'top') +
facet_wrap(vars(Var), scales = 'free') +
labs(y = 'new deaths', x = 'Number') +
guides(col = 'none')
p1 / p2
vaccine_df1 <- vaccine_df %>% dplyr::select(-date, vaccinated) %>%
na.omit
colnames(vaccine_df1) <- c('New Deaths', 'ICU Patients', 'People for Vaccinating',
'People for fully Vaccinating', 'Boosters', 'Is Vaccinated')
m_cor <- cor(vaccine_df1)
library(corrplot)
m_sig <- cor.mtest(m_cor)$p
corrplot(corr = m_cor, p.mat = m_sig, order = "AOE", type = "upper",
tl.pos = "t", tl.col = 'black', tl.cex = 0.7, insig = 'blank',
method = 'ellipse')
corrplot(corr = m_cor, p.mat = m_sig, add = TRUE, type = "lower", method = "number", order = "AOE",
diag = FALSE, tl.pos = "n", cl.pos = "n", insig = 'blank')
p1 <- ggplot(vaccine_df, aes(x = new_deaths, y = ..density..)) +
geom_histogram(fill = 'white', col = 'black',bins = 30) +
labs(x = 'new deaths') +
theme_bw()
p2 <- ggplot(vaccine_df, aes(x = icu_patients, y = ..density..)) +
geom_histogram(fill = 'white', col = 'black',bins = 20) +
labs(x = 'ICU patients') +
theme_bw()
p3 <- ggplot(vaccine_df, aes(x = log(new_deaths), y = ..density..)) +
geom_histogram(fill = 'white', col = 'black',bins = 30) +
labs(x = 'log(new deaths)') +
theme_bw()
p4 <- ggplot(vaccine_df, aes(x = log(icu_patients), y = ..density..)) +
geom_histogram(fill = 'white', col = 'black',bins = 30) +
labs(x = 'log(ICU patients)') +
theme_bw()
(p1 + p2)/(p3 + p4)
vaccine_df_1 <- vaccine_df %>%
dplyr::select(new_deaths, icu_patients, date) %>% # select variables
mutate(date = date + 14) %>% # time lag two weeks
# Merge vaccine data with 14-day lag death data
inner_join(vaccine_df %>%
dplyr::select(date, people_vaccinated_per_hundred,
people_fully_vaccinated_per_hundred,
total_boosters_per_hundred, vaccinated)) %>%
mutate(new_deaths = log(new_deaths),
icu_patients = log(icu_patients))
colnames(vaccine_df_1)[7] <- 'Is_Vaccinated'
library(sandwich)
library(car)
mod1 <- lm(log(new_deaths) ~ log(icu_patients) +
people_fully_vaccinated_per_hundred +
total_boosters_per_hundred +
Is_Vaccinated, data = vaccine_df_1)
library(lmtest)
dwtest(mod1)
coeftest(mod1, vcov. = vcovHAC(mod1))
gglm(mod1,
theme = theme(panel.background = element_blank(),
panel.grid.major = element_blank(),
panel.grid.major.y = element_line(colour = 'grey')))
vaccine_df_2 <- vaccine_df %>%
dplyr::select(new_deaths, icu_patients, date) %>% # select variables
mutate(date = date + 21) %>% # time lag three weeks
# Merge vaccine data with 21-day lag death data
inner_join(vaccine_df %>%
dplyr::select(date, people_vaccinated_per_hundred,
people_fully_vaccinated_per_hundred,
total_boosters_per_hundred, vaccinated)) %>%
mutate(new_deaths = log(new_deaths),
icu_patients = log(icu_patients))
colnames(vaccine_df_2)[7] <- 'Is_Vaccinated'
mod2 <- lm(log(new_deaths) ~ log(icu_patients) +
people_fully_vaccinated_per_hundred +
total_boosters_per_hundred +
Is_Vaccinated, data = vaccine_df_2)
dwtest(mod2)
coeftest(mod2, vcov. = vcovHAC(mod2))
gglm(mod1,
theme = theme(panel.background = element_blank(),
panel.grid.major = element_blank(),
panel.grid.major.y = element_line(colour = 'grey')))
mod3 <- lm(log(icu_patients) ~ people_fully_vaccinated_per_hundred +
total_boosters_per_hundred + Is_Vaccinated, data = vaccine_df_1)
dwtest(mod3)
coeftest(mod3, vcov. = vcovHAC(mod3))
gglm(mod3,
theme = theme(panel.background = element_blank(),
panel.grid.major = element_blank(),
panel.grid.major.y = element_line(colour = 'grey')))